46 research outputs found

    Compiler-Injected SIHFT for Embedded Operating Systems

    Get PDF
    Random hardware faults are a major concern for critical systems, especially when they are employed in high-radiation environments such as aerospace applications. While specialised hardware already exists for implementing fault tolerance, software solutions, named Software-Implemented Hardware Fault Tolerance (SIHFT), offer higher flexibility at a lower cost. This work describes a compiler-based approach for inserting instruction-level fault detection mechanisms in both the application code and the operating system. An experimental evaluation on a STM32 board running FreeRTOS shows the effectiveness of the proposed approach in detecting faults

    Enabling Software Technologies for Critical COTS-based Spacecraft Systems

    Get PDF
    In this position article, we motivate the necessity to introduce three software methods in spacecraft computing platforms in order to enable to use COTS components: SIHFT, mixed-criticality, and probabilistic timing analysis. We investigate the benefits and the drawbacks of these techniques, especially in terms of safety, by also analyzing the standards to identify the current limitations that do not allow such techniques to be used. Finally, we recap current and future works, highlighting possible changes to standards

    Software Fault Tolerance in Real-Time Systems: Identifying the Future Research Questions

    Get PDF
    Tolerating hardware faults in modern architectures is becoming a prominent problem due to the miniaturization of the hardware components, their increasing complexity, and the necessity to reduce the costs. Software-Implemented Hardware Fault Tolerance approaches have been developed to improve the system dependability to hardware faults without resorting to custom hardware solutions. However, these come at the expense of making the satisfaction of the timing constraints of the applications/activities harder from a scheduling standpoint. This paper surveys the current state of the art of fault tolerance approaches when used in the context real-time systems, identifying the main challenges and the cross-links between these two topics. We propose a joint scheduling-failure analysis model that highlights the formal interactions among software fault tolerance mechanisms and timing properties. This model allows us to present and discuss many open research questions with the final aim to spur the future research activities

    Poster Abstract: Run-time Dynamic WCET Estimation

    Get PDF
    To guarantee the timing constraints of real-time IoT devices, engineers need to estimate the Worst-Case Execution Time. Such estimation is always very pessimistic and represents a condition that almost never occurs in practice. In this poster, we present a novel compiler-based approach that instruments the tasks to inform, at run-time, the operating system when non-worst-case branches are taken. The generated slack is then used to take better scheduling decisions

    Probabilistic-WCET Reliability: Statistical Testing of EVT hypotheses

    Get PDF
    In recent years, the interest in probabilistic real-time has grown, as a response to the limitations of traditional static Worst-Case Execution Time (WCET) methods, in performing timing analysis of applications running on complex systems, like multi/many-cores and COTS platforms. The probabilistic theory can partially solve this problem, but it requires strong guarantees on the execution time traces, in order to provide safe probabilistic-WCET estimations. These requirements can be verified through suitable statistical tests, as described in this paper. In this work, we identify also challenges and problems of using statistical testing procedures in probabilistic real-time computing, proposing a unified test procedure based on a single index called Probabilistic Predictability Index (PPI). An experimental campaign has been carried out, considering both synthetic and realistic datasets, and the analysis of the impact of the Linux PREEMPT_RT patch on a modern complex platform as a use-case of the proposed index

    The MIG Framework: Enabling Transparent Process Migration in Open MPI

    Get PDF
    This paper introduces the mig framework: an Open MPI extension to transparently support the migration of application processes, over different nodes of a distributed High-Performance Computing (HPC) system. The framework provides mechanism on top of which suitable resource managers can implement policies to react to hardware faults, address performance variability, improve resource utilization, perform a fine-grained load balancing and power thermal management. Compared to other state-of-the-art approaches, the mig framework does not require changes in the application code. Moreover, it is highly maintainable, since it is mainly a self-contained solution that has required a very few changes in other already existing Open MPI frameworks. Experimental results have shown that the proposed extension does not introduce significant overhead in the application execution, while the penalty due to performing a migration can be properly taken into account by a resource manager

    Non-Preemptive Scheduling of Periodic Mixed-Criticality Real-Time Systems

    Get PDF
    In this work we develop an offline analysis of periodic mixed-criticality real-time systems. We develop a graph-based exploratory method to non-preemptively schedule multiple criticality tasks. The exploration process obtains a schedule for each periodic instance of the tasks. The schedule adjusts for criticality mode changes to maximize the resource usage by allowing lower criticality executions. At the same time, it ensures that the schedulability of other higher criticality jobs is never compromised. We also quantify the probabilities associated to a criticality mode change by using task probabilistic Worst Case Execution Times. A method to reduce the offline complexity is also proposed.info:eu-repo/semantics/publishedVersio

    The Misconception of Exponential Tail Upper-Bounding in Probabilistic Real-Time

    Get PDF
    Measurement-Based Probabilistic Timing Analysis, a probabilistic real-time computing method, is based on the Extreme Value Theory (EVT), a statistical theory applied to Worst-Case Execution Time analysis on real-time embedded systems. The output of the EVT theory is a statistical distribution, in the form of Generalized Extreme Value Distribution or Generalized Pareto Distribution. Their cumulative distribution function can asymptotically assume one of three possible forms: light, exponential or heavy tail. Recently, several works proposed to upper-bound the light-tail distributions with their exponential version. In this paper, we show that this assumption is valid only under certain conditions and that it is often misinterpreted. This leads to unsafe estimations of the worst-case execution time, which cannot be accepted in applications targeting safety critical embedded systems

    A Hierarchical Approach for Resource Management in Heterogeneous Systems

    Get PDF
    Heterogeneous architectures are emerging as a dominant trend for HPC, mainly thanks to their high performance-per-watt ratio. Dealing with heterogeneity and task-based applications requires to consider different aspects at both infrastructures level and single node in order to meet power, thermal and performance requirements. Thus, in order to provide an effective and fine-grained management of the available resources, as well as balancing the load by dispatching applications among the different computing nodes, we proposed a hierarchical approach in which different resource managers, running in the nodes, collaborate to reach a multi-objectives optimization

    Reliability-oriented resource management for High-Performance Computing

    Get PDF
    Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies
    corecore